Business Question

1.Predicting the likelihood of a diabetic patient getting readmitted (Based on 70000 medical records from 130 US hospitals from 1999-2008) 2. Chemical medication vs Biological medication 3. Who has the higher probability of having higher Hb1ac

Data understanding

dData%>%
knitr::kable(caption = "Data variables description", digits = 3)%>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE,position = "left")
## Warning in kableExtra::kable_styling(., bootstrap_options = "striped",
## full_width = FALSE, : Please specify format in kable. kableExtra can customize
## either HTML or LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for
## details.
Data variables description
Variable description
encounter_id Id given during visit of patient
patient_nbr Patient number
race Race of the patient
Gender Gender of patient
age age of patient
weight weight of patient
admission_type_id 1=Emergency
2=Urgent
3=Elective
4=Newborn
5=Not Available
6=NULL
7=Trauma Center
8=Not Mapped
discharge_disposition_id 1=Discharged to home
2=Discharged/transferred to another short term hospital
3=Discharged/transferred to SNF
4=Discharged/transferred to ICF
5=Discharged/transferred to another type of inpatient care institution
6=Discharged/transferred to home with home health service
7=Left AMA
8=Discharged/transferred to home under care of Home IV provider
9=Admitted as an inpatient to this hospital
10=Neonate discharged to another hospital for neonatal aftercare
11=Expired
12=Still patient or expected to return for outpatient services
13=Hospice / home
14=Hospice / medical facility
15=Discharged/transferred within this institution to Medicare approved swing bed
16=Discharged/transferred/referred another institution for outpatient services
17=Discharged/transferred/referred to this institution for outpatient services
18=NULL
19=Expired at home. Medicaid only, hospice.
20=Expired in a medical facility. Medicaid only, hospice.
21=Expired, place unknown. Medicaid only, hospice.
22=Discharged/transferred to another rehab fac including rehab units of a hospital .
23=Discharged/transferred to a long term care hospital.
24=Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare.
25=Not Mapped
26=Unknown/Invalid
30=Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere
27=Discharged/transferred to a federal health care facility.
28=Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital
29=Discharged/transferred to a Critical Access Hospital (CAH).
admission_source_id 1= Physician Referral
2=Clinic Referral
3=HMO Referral
4=Transfer from a hospital
5= Transfer from a Skilled Nursing Facility (SNF)
6= Transfer from another health care facility
7= Emergency Room
8= Court/Law Enforcement
9= Not Available
10= Transfer from critial access hospital
11=Normal Delivery
12= Premature Delivery
13= Sick Baby
14= Extramural Birth
15=Not Available
17=NULL
18= Transfer From Another Home Health Agency
19=Readmission to Same Home Health Agency
20= Not Mapped
21=Unknown/Invalid
22= Transfer from hospital inpt/same fac reslt in a sep claim
23= Born inside this hospital
24= Born outside this hospital
25= Transfer from Ambulatory Surgery Center
26=Transfer from Hospice
time_in_hospital Time spent in hospital in months
payer_code Payer payment code
medical_specialty Area/field of medicne
num_lab_procedures lab procedures available
num_procedures lab procedures done
num_medications number of medications
number_emergency admitted as an emegency
number_outpatient number of times admitted as an out patient
number_inpatient number of times admitted as an in patient
diag_1 diagnosis 1
diag_2 diagnosis 2
diag_3 diagnosis 3
number_diagnoses number of diagnoses done
max_glu_serum Glucose serum test result
A1Cresult Hb A1C or hemoglobin A1c (shows suger level in blood)
metformin one of the feature of medication
repaglinide one of the feature of medication
nateglinide one of the feature of medication
chlorpropamide one of the feature of medication
glimepiride one of the feature of medication
acetohexamide one of the feature of medication
glipizide one of the feature of medication
glyburide one of the feature of medication
tolbutamide one of the feature of medication
pioglitazone one of the feature of medication
rosiglitazone one of the feature of medication
acarbose one of the feature of medication
miglitol one of the feature of medication
troglitazone one of the feature of medication
tolazamide one of the feature of medication
examide one of the feature of medication
citoglipton one of the feature of medication
insulin one of the feature of medication
glyburide-metformin one of the feature of medication
glipizide-metformin one of the feature of medication
glimepiride-pioglitazone one of the feature of medication
metformin-rosiglitazone one of the feature of medication
metformin-pioglitazone one of the feature of medication
change Change of medication
diabetesMed Diabetes medications
readmitted Readmission to hospitel
skim(mData) # Computing statistics by data type
Data summary
Name mData
Number of rows 101766
Number of columns 50
_______________________
Column type frequency:
factor 37
numeric 13
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
race 0 1 FALSE 6 Cau: 76099, Afr: 19210, ?: 2273, His: 2037
gender 0 1 FALSE 3 Fem: 54708, Mal: 47055, Unk: 3
age 0 1 FALSE 10 [70: 26068, [60: 22483, [50: 17256, [80: 17197
weight 0 1 FALSE 10 ?: 98569, [75: 1336, [50: 897, [10: 625
payer_code 0 1 FALSE 18 ?: 40256, MC: 32439, HM: 6274, SP: 5007
medical_specialty 0 1 FALSE 73 ?: 49949, Int: 14635, Eme: 7565, Fam: 7440
diag_1 0 1 FALSE 717 428: 6862, 414: 6581, 786: 4016, 410: 3614
diag_2 0 1 FALSE 749 276: 6752, 428: 6662, 250: 6071, 427: 5036
diag_3 0 1 FALSE 790 250: 11555, 401: 8289, 276: 5175, 428: 4577
max_glu_serum 0 1 FALSE 4 Non: 96420, Nor: 2597, >20: 1485, >30: 1264
A1Cresult 0 1 FALSE 4 Non: 84748, >8: 8216, Nor: 4990, >7: 3812
metformin 0 1 FALSE 4 No: 81778, Ste: 18346, Up: 1067, Dow: 575
repaglinide 0 1 FALSE 4 No: 100227, Ste: 1384, Up: 110, Dow: 45
nateglinide 0 1 FALSE 4 No: 101063, Ste: 668, Up: 24, Dow: 11
chlorpropamide 0 1 FALSE 4 No: 101680, Ste: 79, Up: 6, Dow: 1
glimepiride 0 1 FALSE 4 No: 96575, Ste: 4670, Up: 327, Dow: 194
acetohexamide 0 1 FALSE 2 No: 101765, Ste: 1
glipizide 0 1 FALSE 4 No: 89080, Ste: 11356, Up: 770, Dow: 560
glyburide 0 1 FALSE 4 No: 91116, Ste: 9274, Up: 812, Dow: 564
tolbutamide 0 1 FALSE 2 No: 101743, Ste: 23
pioglitazone 0 1 FALSE 4 No: 94438, Ste: 6976, Up: 234, Dow: 118
rosiglitazone 0 1 FALSE 4 No: 95401, Ste: 6100, Up: 178, Dow: 87
acarbose 0 1 FALSE 4 No: 101458, Ste: 295, Up: 10, Dow: 3
miglitol 0 1 FALSE 4 No: 101728, Ste: 31, Dow: 5, Up: 2
troglitazone 0 1 FALSE 2 No: 101763, Ste: 3
tolazamide 0 1 FALSE 3 No: 101727, Ste: 38, Up: 1
examide 0 1 FALSE 1 No: 101766
citoglipton 0 1 FALSE 1 No: 101766
insulin 0 1 FALSE 4 No: 47383, Ste: 30849, Dow: 12218, Up: 11316
glyburide.metformin 0 1 FALSE 4 No: 101060, Ste: 692, Up: 8, Dow: 6
glipizide.metformin 0 1 FALSE 2 No: 101753, Ste: 13
glimepiride.pioglitazone 0 1 FALSE 2 No: 101765, Ste: 1
metformin.rosiglitazone 0 1 FALSE 2 No: 101764, Ste: 2
metformin.pioglitazone 0 1 FALSE 2 No: 101765, Ste: 1
change 0 1 FALSE 2 No: 54755, Ch: 47011
diabetesMed 0 1 FALSE 2 Yes: 78363, No: 23403
readmitted 0 1 FALSE 3 NO: 54864, >30: 35545, <30: 11357

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
encounter_id 0 1 165201645.62 102640295.98 12522 84961194 152388987 230270888 443867222 <U+2586><U+2587><U+2585><U+2582><U+2582>
patient_nbr 0 1 54330400.69 38696359.35 135 23413221 45505143 87545950 189502619 <U+2587><U+2586><U+2586><U+2581><U+2581>
admission_type_id 0 1 2.02 1.45 1 1 1 3 8 <U+2587><U+2582><U+2581><U+2581><U+2581>
discharge_disposition_id 0 1 3.72 5.28 1 1 1 4 28 <U+2587><U+2581><U+2581><U+2581><U+2581>
admission_source_id 0 1 5.75 4.06 1 1 7 7 25 <U+2585><U+2587><U+2581><U+2581><U+2581>
time_in_hospital 0 1 4.40 2.99 1 2 4 6 14 <U+2587><U+2585><U+2582><U+2581><U+2581>
num_lab_procedures 0 1 43.10 19.67 1 31 44 57 132 <U+2583><U+2587><U+2585><U+2581><U+2581>
num_procedures 0 1 1.34 1.71 0 0 1 2 6 <U+2587><U+2582><U+2581><U+2581><U+2581>
num_medications 0 1 16.02 8.13 1 10 15 20 81 <U+2587><U+2583><U+2581><U+2581><U+2581>
number_outpatient 0 1 0.37 1.27 0 0 0 0 42 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_emergency 0 1 0.20 0.93 0 0 0 0 76 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_inpatient 0 1 0.64 1.26 0 0 0 1 21 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_diagnoses 0 1 7.42 1.93 1 6 8 9 16 <U+2581><U+2585><U+2587><U+2581><U+2581>

Data preparation

Transforming data type

# Converting variables with categorical values into factors
mData$admission_type_id <- as.factor(mData$admission_type_id) 
mData$discharge_disposition_id <- as.factor(mData$discharge_disposition_id)
mData$admission_source_id <- as.factor(mData$admission_source_id)

Replace/remove missing values

# Replacing instances of variables where there is "?" or "Unknown/Invalid"

count <- 0
for(i in 1:ncol(mData)){
  if(is.factor(mData[,i])){
    for(j in 1:nrow(mData)){
      if(mData[j,i]== "?" | mData[j,i]== "Unknown/Invalid" ){
        count <- count + 1
        mData[j,i] <- NA
      }
    }
    if(count > 0){
      print(c(colnames(mData)[i],count))
    }
  }
  count <- 0
}
## [1] "race" "2273"
## [1] "gender" "3"     
## [1] "weight" "98569" 
## [1] "payer_code" "40256"     
## [1] "medical_specialty" "49949"            
## [1] "diag_1" "21"    
## [1] "diag_2" "358"   
## [1] "diag_3" "1423"
dim(mData)
## [1] 101766     50
# Heat map to see missing data of variables
heatmap(1 * is.na(mData), Rowv = NA, Colv = NA)

mData$x <- NULL # Removing empty column that is first column

mData$medical_specialty <- NULL #We can either keep this to just show some classification or can drop as it is no use for our analysis

mData$weight <- NULL # Removing weight as instances are not available due to privacy concern

mData$encounter_id <- NULL # This is not necessary as we wont be analyzing anything out of it

mData$payer_code <- NULL #Removing payer_code as instances are not available due to privacy concern

mData$examide <- NULL #Monotonous, only has one values

mData$citoglipton <- NULL #Monotonous, only has one values

# mData[complete.cases(mData), ] # Displays all instances which has complete data

# mData[!complete.cases(mData), ] # Diplasys all instances which has NA in any variable 

mDatao <- na.omit(mData) # Omitting all the instances where values are NA

dim(mDatao) # Updated dimension of data
## [1] 98052    44
str(mDatao) # Updated Data statistics
## 'data.frame':    98052 obs. of  44 variables:
##  $ patient_nbr             : int  55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 89869032 ...
##  $ race                    : Factor w/ 6 levels "?","AfricanAmerican",..: 4 2 4 4 4 4 4 4 4 2 ...
##  $ gender                  : Factor w/ 3 levels "Female","Male",..: 1 1 2 2 2 2 2 1 1 1 ...
##  $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 2 3 4 5 6 7 8 9 10 5 ...
##  $ admission_type_id       : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 2 3 1 2 3 1 ...
##  $ discharge_disposition_id: Factor w/ 26 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ admission_source_id     : Factor w/ 17 levels "1","2","3","4",..: 7 7 7 7 2 2 7 4 4 7 ...
##  $ time_in_hospital        : int  3 2 2 1 3 4 5 13 12 9 ...
##  $ num_lab_procedures      : int  59 11 44 51 31 70 73 68 33 47 ...
##  $ num_procedures          : int  0 5 1 0 6 1 0 2 3 2 ...
##  $ num_medications         : int  18 13 16 8 16 21 12 28 18 17 ...
##  $ number_outpatient       : int  0 2 0 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ diag_1                  : Factor w/ 717 levels "?","10","11",..: 145 456 556 56 265 265 278 254 284 122 ...
##  $ diag_2                  : Factor w/ 749 levels "?","11","110",..: 81 80 99 26 248 248 316 262 48 243 ...
##  $ diag_3                  : Factor w/ 790 levels "?","11","110",..: 123 768 250 88 88 772 88 231 319 668 ...
##  $ number_diagnoses        : int  9 6 7 5 9 7 8 8 8 9 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 3 2 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 3 2 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 3 2 3 2 2 2 3 2 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 4 2 4 3 3 3 2 3 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 1 2 1 1 2 1 2 1 1 2 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 2 3 3 3 2 3 2 3 3 2 ...
##  - attr(*, "na.action")= 'omit' Named int  1 20 21 22 55 66 67 88 100 112 ...
##   ..- attr(*, "names")= chr  "1" "20" "21" "22" ...
# library(heatmaply)
# heatmaply_na(mDatao,showticklabels = c(TRUE, FALSE))
# 
#  round(cor(mDatao),2)%>%
#  knitr::kable(caption = "", digits = 3)%>%
#    kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE,position = "left")%>%
#    row_spec(0, bold = T)%>%
#    column_spec(1,bold = TRUE, italic = TRUE)

Grouping/Collapsing features of variables

#The variable discharge__disposition_id informs us about where the patient went getting discharged from the hospital. 11,13,14,19,20,21 can be related to death or hospice, which implies that we need to remove them from as they will not be getting readmitted.

par(mfrow = c(1,2))
barplot(table(mDatao$discharge_disposition_id), main = "Before dropping")

mDatao <- mDatao[!mDatao$discharge_disposition_id %in% c(11,13,14,19,20,21), ]
barplot(table(mDatao$discharge_disposition_id), main = "After dropping")

#I am renaming admission_type_id to admission_type and then I am going to collapse their factors and club some of them together as they are similar 
colnames(mDatao)[5] <- "admission_type"
barplot(table(mDatao$admission_type))

mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 2, 1)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 8, 5)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 6, 5)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 7, 1)

barplot(table(mDatao$admission_type), main = "Admission types after data collapsing")

#I am changing name of factors in the variable for better understanding
mDatao$admission_type <- str_replace(mDatao$admission_type,"1","Emergency")
mDatao$admission_type <- str_replace(mDatao$admission_type,"5","Other")
mDatao$admission_type <- str_replace(mDatao$admission_type,"3","Elective")
mDatao$admission_type <- str_replace(mDatao$admission_type,"4","Newborn")

mDatao$admission_type <- as.factor(mDatao$admission_type)
barplot(table(mDatao$admission_type))

#I am renaming variable "admission_source_id" to "admission_source" 
colnames(mDatao)[7] <- "admission_source"
barplot(table(mDatao$admission_source))

#I am grouping/collapsing the factors of variables based on their similar nature 
mDatao$admission_source <- case_when(mDatao$admission_source %in% c("1","2","3") ~ "Physician Referral",mDatao$admission_source %in% c("4","5","6","8","9","10","11","12","13","14","15","17","18","19","20","21","22","23","24","25","26")~"Other",TRUE~"Emergency Room")                                          

mDatao$admission_source <- as.factor(mDatao$admission_source)
barplot(table(mDatao$admission_source), main = "Post collapsing and changing type of admission")

#I am renaming the column "discharge_disposition_id" to "discharge_disposition" 
colnames(mDatao)[6] <- "discharge_disposition"
barplot(table(mDatao$discharge_disposition))

#collapsing some other variables and grouping according to convenience
mDatao$discharge_disposition <- case_when(mDatao$discharge_disposition %in% "1" ~ "Home", TRUE ~ "Other")

mDatao$discharge_disposition <- as.factor(mDatao$discharge_disposition)
barplot(table(mDatao$discharge_disposition), main = "After collapsing and changing the type")

Categorization of features if variables

mDatao$diag_1 <- as.character(mDatao$diag_1)
# All the diagnoses variables values are present in ICD-9 codes format, based on which I am grouping them according to the type of the problem found in diagnoses
mDatao<- mutate(mDatao, primary_diagnosis = ifelse(str_detect(diag_1, "V") | str_detect(diag_1, "E"),"Other",
ifelse(str_detect(diag_1, "250"), "Diabetes",
ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | as.integer(diag_1) == 785, "Circulatory",
ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | as.integer(diag_1) == 786, "Respiratory", 
ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | as.integer(diag_1) == 787, "Digestive", 
ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | as.integer(diag_1) == 788, "Genitourinary",
ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), "Neoplasms", ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), "Musculoskeletal", ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), "Injury", "Other"))))))))))
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), : NAs
## introduced by coercion
mDatao$primary_diagnosis <- as.factor(mDatao$primary_diagnosis)

table(mDatao$primary_diagnosis)
## 
##     Circulatory        Diabetes       Digestive   Genitourinary          Injury 
##           28887            7870            9045            4870            6590 
## Musculoskeletal       Neoplasms           Other     Respiratory 
##            4717            3013           17169           13511
#removing "diag variables"
mDatao$diag_1 <- NULL
mDatao$diag_2 <- NULL
mDatao$diag_3 <- NULL
barplot(table(mDatao$age))

#I am regrouping the "age" to [0-40],[40-50],[50-60],[60-70],[70-80],[80-100]
mDatao$age <- case_when(mDatao$age %in% c("[0-10)","[10-20)","[20-30)","[30-40)") ~ "[0-40]",
                       mDatao$age %in% c("[80-90)","[90-100)") ~ "[80-100]",
                       mDatao$age %in% "[40-50)" ~ "[40-50]",
                       mDatao$age %in% "[50-60)" ~ "[50-60]",
                       mDatao$age %in% "[60-70)" ~ "[60-70]", TRUE ~ "[70-80]")

barplot(table(mDatao$age), main = "Regrouped Age")

mDatao$age <- as.factor(mDatao$age)
#Now I am categorizing "readmitted" variable to 1 -if the patient was readmitted within 30 days, 0 -if the readmission was after 30 days or there is no readmission
mDatao$readmitted <- case_when(mDatao$readmitted %in% c(">30","NO") ~ "0", TRUE ~ "1")
mDatao$readmitted <- as.factor(mDatao$readmitted)
levels(mDatao$readmitted)
## [1] "0" "1"
#I am removing multiple records of a patient who had multiple encounters
mDatao <- mDatao[!duplicated(mDatao$patient_nbr),]
#Now I am also removing "patient_nbr"
mDatao$patient_nbr <- NULL
dim(mDatao)
## [1] 67128    41

Removing outliers

#I am identifying the variables that has outliers and removing them
par(mfrow = c(2,4))
boxplot(mDatao$time_in_hospital, main = "time_in_hospital")
boxplot(mDatao$number_outpatient, main = "number_outpatient")
boxplot(mDatao$number_emergency, main = "number_emergency")
boxplot(mDatao$num_lab_procedures, main = "num_lab_procedures")
boxplot(mDatao$number_diagnoses, main = "number_diagnoses")
boxplot(mDatao$number_inpatient, main = "number_inpatient")
boxplot(mDatao$num_procedures, main = "num_procedures")
boxplot(mDatao$num_medications, main = "num_medications")

#These there variables has scattered values, hence removing them
mDatao$number_emergency <- NULL
mDatao$number_inpatient <- NULL
mDatao$number_outpatient <- NULL

#Trying to remove outliers 
outliers_remover <- function(a){
  df <- a
  aa <- c()
  count <- 1
  for(i in 1:ncol(df)){
    if(is.integer(df[,i])){
      Q3 <- quantile(df[,i], 0.75, na.rm = TRUE)
      Q1 <- quantile(df[,i], 0.25, na.rm = TRUE) 
      IQR <- Q3 - Q1  #IQR(df[,i])
      upper <- Q3 + 1.5 * IQR
      lower <- Q1 - 1.5 * IQR
      for(j in 1:nrow(df)){
        if(is.na(df[j,i]) == TRUE){
          next
        }
        else if(df[j,i] > upper | df[j,i] < lower){
          aa[count] <- j
          count <- count+1                  
        }
      }
    }
  }
  df <- df[-aa,]
}
mDatao <- outliers_remover(mDatao)
pairs.panels(mDatao[c("time_in_hospital", "num_lab_procedures", "num_procedures", "num_medications", "number_diagnoses")])

table(mDatao$readmitted)
## 
##     0     1 
## 55628  5559
mDatao$repaglinide <- NULL
mDatao$nateglinide <- NULL
mDatao$chlorpropamide <-NULL
mDatao$acetohexamide <- NULL
mDatao$tolbutamide <- NULL
mDatao$acarbose <- NULL
mDatao$miglitol <- NULL
mDatao$troglitazone <- NULL
mDatao$tolazamide <- NULL
mDatao$glyburide.metformin <- NULL
mDatao$glipizide.metformin <- NULL
mDatao$glimepiride.pioglitazone <- NULL
mDatao$metformin.rosiglitazone <- NULL
mDatao$metformin.pioglitazone <- NULL

dim(mDatao)
## [1] 61187    24

Features of medicines that can be removed as suggested by Boruta

Features selection

There are many techniques like Boruta, Mars in R which help us to identify important variables (reference:http://r-statistics.co/Variable-Selection-and-Importance-With-R.html)

# ensure results are repeatable
set.seed(100)
boruta <- Boruta(readmitted ~., data = mDatao, doTrace = 2)
##  1. run of importance source...
## Computing permutation importance.. Progress: 41%. Estimated remaining time: 43 seconds.
## Computing permutation importance.. Progress: 88%. Estimated remaining time: 8 seconds.
##  2. run of importance source...
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
##  3. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
##  4. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
##  5. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
##  6. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
##  7. run of importance source...
## Computing permutation importance.. Progress: 57%. Estimated remaining time: 23 seconds.
##  8. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
##  9. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
##  10. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
##  11. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
##  12. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## After 12 iterations, +15 mins:
##  confirmed 16 attributes: A1Cresult, admission_source, admission_type, age, change and 11 more;
##  rejected 2 attributes: glimepiride, glyburide;
##  still have 5 attributes left.
##  13. run of importance source...
## Computing permutation importance.. Progress: 67%. Estimated remaining time: 14 seconds.
##  14. run of importance source...
## Computing permutation importance.. Progress: 67%. Estimated remaining time: 14 seconds.
##  15. run of importance source...
## Computing permutation importance.. Progress: 73%. Estimated remaining time: 11 seconds.
##  16. run of importance source...
## Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
## After 16 iterations, +20 mins:
##  rejected 2 attributes: glipizide, rosiglitazone;
##  still have 3 attributes left.
##  17. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
##  18. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  19. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  20. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
##  21. run of importance source...
## Computing permutation importance.. Progress: 86%. Estimated remaining time: 5 seconds.
##  22. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
##  23. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  24. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
##  25. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  26. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
## After 26 iterations, +29 mins:
##  confirmed 1 attribute: diabetesMed;
##  still have 2 attributes left.
##  27. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
##  28. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  29. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  30. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  31. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
##  32. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
##  33. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  34. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
## After 34 iterations, +37 mins:
##  rejected 1 attribute: pioglitazone;
##  still have 1 attribute left.
##  35. run of importance source...
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 3 seconds.
##  36. run of importance source...
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 3 seconds.
##  37. run of importance source...
## Computing permutation importance.. Progress: 73%. Estimated remaining time: 11 seconds.
##  38. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  39. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  40. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  41. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  42. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
##  43. run of importance source...
## Computing permutation importance.. Progress: 79%. Estimated remaining time: 8 seconds.
##  44. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
##  45. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
##  46. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
##  47. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
##  48. run of importance source...
## Computing permutation importance.. Progress: 78%. Estimated remaining time: 8 seconds.
##  49. run of importance source...
## Computing permutation importance.. Progress: 80%. Estimated remaining time: 7 seconds.
##  50. run of importance source...
## Computing permutation importance.. Progress: 81%. Estimated remaining time: 7 seconds.
##  51. run of importance source...
## Computing permutation importance.. Progress: 81%. Estimated remaining time: 7 seconds.
## After 51 iterations, +53 mins:
##  confirmed 1 attribute: gender;
##  no more attributes left.
plot(boruta, las = 2, cex.axis = 0.5)

plotImpHistory(boruta)

attStats(boruta)
##                          meanImp  medianImp     minImp     maxImp   normHits
## race                   5.9473145  5.8424548  3.7600833  8.0492192 1.00000000
## gender                 2.7864856  2.6946927  0.5195113  5.2730131 0.74509804
## age                   22.5000512 22.5729643 18.3341191 26.2400991 1.00000000
## admission_type        23.0409482 23.2776984 18.4581737 26.1686836 1.00000000
## discharge_disposition 29.5168999 29.5817939 25.4113431 34.4669588 1.00000000
## admission_source      24.8510776 24.8877501 21.4608738 28.3055906 1.00000000
## time_in_hospital      30.8222395 30.9236807 26.6336428 35.4899781 1.00000000
## num_lab_procedures    26.9177497 27.2026971 22.9035450 30.1815857 1.00000000
## num_procedures        20.1157786 19.7861094 16.8729665 23.5431608 1.00000000
## num_medications       33.1378005 33.1587865 29.0765147 37.1821540 1.00000000
## number_diagnoses      22.9835165 23.1118335 19.3429959 27.9644279 1.00000000
## max_glu_serum         15.8074993 15.8644383 12.7864356 17.8362448 1.00000000
## A1Cresult             11.6920435 11.5936744  9.9718235 13.7780133 1.00000000
## metformin              8.5465352  8.6001694  6.7347819 10.3895841 1.00000000
## glimepiride           -0.8051658 -0.9638350 -2.5045546  0.8300808 0.00000000
## glipizide              0.8175565  0.9509885 -0.6958722  2.0485499 0.01960784
## glyburide             -2.6147356 -2.3578022 -4.9575220 -1.3527214 0.00000000
## pioglitazone           0.8642983  0.7435128 -1.6428089  3.3264061 0.13725490
## rosiglitazone          0.3221866  0.3920429 -1.6624047  1.9048797 0.01960784
## insulin                6.7471171  7.0152141  4.0320129  9.2008328 1.00000000
## change                13.3461142 12.9683995 11.1588505 16.1669477 1.00000000
## diabetesMed            7.4033560  8.3950637  0.4972511 11.4332937 0.92156863
## primary_diagnosis     17.7373911 17.6131008 15.8138216 21.0660926 1.00000000
##                        decision
## race                  Confirmed
## gender                Confirmed
## age                   Confirmed
## admission_type        Confirmed
## discharge_disposition Confirmed
## admission_source      Confirmed
## time_in_hospital      Confirmed
## num_lab_procedures    Confirmed
## num_procedures        Confirmed
## num_medications       Confirmed
## number_diagnoses      Confirmed
## max_glu_serum         Confirmed
## A1Cresult             Confirmed
## metformin             Confirmed
## glimepiride            Rejected
## glipizide              Rejected
## glyburide              Rejected
## pioglitazone           Rejected
## rosiglitazone          Rejected
## insulin               Confirmed
## change                Confirmed
## diabetesMed           Confirmed
## primary_diagnosis     Confirmed
boruta
## Boruta performed 51 iterations in 53.09086 mins.
##  18 attributes confirmed important: A1Cresult, admission_source,
## admission_type, age, change and 13 more;
##  5 attributes confirmed unimportant: glimepiride, glipizide, glyburide,
## pioglitazone, rosiglitazone;

Data segregation

I am splitting the data in to 80% for training and 20% for testing

set.seed(100)
train <- createDataPartition(mDatao$readmitted, p = 0.8, list = FALSE)
training <- mDatao[train, ]
testing <- mDatao[-train, ]
#checking dependent variable(training set)
table(training$readmitted)
## 
##     0     1 
## 44503  4448

K fold cross validation

for validation purposes I am looking at implementing K fold cross validation, the working of this technique is to be understood and hence not created validation set yet

Models and their evaluation